Implement instructions for matching builtin character classes and assertions #547

rctcwyvrn · 2022-07-05T23:49:34Z

Adds matchBuiltin to match builtins instead of a consumeFn calling _CharacterClassModel.matches in most cases
Removes AssertionFunction and changes assertBy to match in Processor instead

There are some gains (10% in microbenchmarks on builtin character classes) but the main purpose is to set up for future work

Future work:

The layout of the payloads for matchBuiltin and assertBy will be optimized (hopefully soon) along with the rest of the instruction layout. Changes made in this branch and the scalar optimization branch showed that this is a strong next step
I wasn't able to fully nuke _CharacterClassModel.matches but I think we're getting very close to being able to rip out most of it and the consumer interface for CustomCharacterClass

rctcwyvrn · 2022-07-06T00:02:19Z

WIP because I need to figure out these weird performance behaviours

Weird thing number 1: The first commit in this branch (copying over the AsciiBitset changes from the matchScalar PR) is resulting in significant speedups

=== Improvements =====================================================================
- emailLookaheadNoMatchesAll              58ms	63ms	-5.05ms		-8.0%
- emailLookaheadNoMatchesFirst            58ms	63ms	-4.95ms		-7.9%
- emailLookaheadAll                       81.9ms	86.8ms	-4.95ms		-5.7%
- basicCCC                                9.63ms	10.7ms	-1.08ms		-10.1%
- caseInsensitiveCCC                      11.3ms	12.3ms	-1.05ms		-8.6%
- invertedCCC                             28.5ms	29.4ms	-946µs		-3.2%
- emailRFCAll                             50.2ms	51.1ms	-942µs		-1.8%
- basicRangeCCC                           10.6ms	11.4ms	-782µs		-6.9%

I sanity checked this by trying to minimize the change (not copying it over to a new file, not adding the new scalar api) and got the same result.

diff --git a/Sources/_StringProcessing/ConsumerInterface.swift b/Sources/_StringProcessing/ConsumerInterface.swift
index af46b53..9c41002 100644
--- a/Sources/_StringProcessing/ConsumerInterface.swift
+++ b/Sources/_StringProcessing/ConsumerInterface.swift
@@ -11,6 +11,13 @@
 
 @_implementationOnly import _RegexParser
 
+extension Character {
+   var _singleScalarAsciiValue: UInt8? {
+     guard self != "\r\n" else { return nil }
+     return asciiValue
+   }
+ }
+

before
      internal func matches(char: Character) -> Bool {
        let ret: Bool
        if let val = char.asciiValue {
          if val < 64 {
            ret = (a >> val) & 1 == 1
          } else {
            ret =  (b >> (val - 64)) & 1 == 1
          }
        } else {
          ret = false
        }

        if isInverted {
          return !ret
        }

        return ret
      }
after
      private func matches(_ val: UInt8) -> Bool {
        if val < 64 {
          return (a >> val) & 1 == 1
        } else {
          return (b >> (val - 64)) & 1 == 1
        }
      }

       internal func matches(char: Character) -> Bool {
         let matched: Bool
         if let val = char._singleScalarAsciiValue {
           matched = matches(val)
         } else {
           matched = false
         }

         if isInverted {
           return !matched
         }
         return matched
       }

Some quick profiling seems to indicate that the performance increase came the call to _singleScalarAsciiValue being faster than the call to asciiValue, even though _singleScalarAsciiValue also uses asciiValue??? My best guess is that it has to do with Character.asciiValue.getter being inlined before the change but not after, though I honestly don't know why that would make a difference

rctcwyvrn · 2022-07-06T00:08:15Z

Weird performance thing number 2: The optimization is causing regressions in regexes that do not produce the new instruction and very little improvement in the ones that do

Comparing with the first commit on the branch we get larger regressions in all of the email regexes and smaller ones in almost every other result

Comparing against benchmark result file sanity_check_after_bitset.json
=== Regressions ======================================================================
- htmlFirst                               26.9µs	26.7µs	209ns		0.8%
- emailLookaheadFirst                     18.5µs	18.2µs	250ns		1.4%
- emailRFCFirst                           8.38µs	7.92µs	458ns		5.8%
- cssFirst                                93µs	91.9µs	1.17µs		1.3%
- HangulSyllableFirst                     4.06ms	4.01ms	47.8µs		1.2%
- cssAll                                  4.87ms	4.82ms	47.9µs		1.0%
- caseInsensitiveCCC                      11.5ms	11.4ms	107µs		0.9%
- ReluctantQuantWithTerminalWhole         10.6ms	10.4ms	113µs		1.1%
- basicRangeCCC                           10.7ms	10.5ms	126µs		1.2%
- HangulSyllableAll                       8.41ms	8.24ms	167µs		2.0%
- htmlAll                                 13.5ms	13.3ms	205µs		1.5%
- subtractionCCC                          15.3ms	15.1ms	278µs		1.8%
- intersectionCCC                         15.8ms	15.5ms	278µs		1.8%
- notFoundAll                             15.1ms	14.8ms	285µs		1.9%
- notFoundFirst                           15.1ms	14.8ms	298µs		2.0%
- symDiffCCC                              39.2ms	38.3ms	885µs		2.3%
- invertedCCC                             29.5ms	28.6ms	937µs		3.3%
- emailLookaheadNoMatchesFirst            59.5ms	58.3ms	1.12ms		1.9%
- emailLookaheadNoMatchesAll              59.3ms	58.2ms	1.13ms		1.9%
- emailRFCAll                             51.5ms	50ms	1.53ms		3.1%
- emailLookaheadAll                       83.5ms	81.8ms	1.72ms		2.1%
- emailRFCNoMatchesAll                    110ms	107ms	3.03ms		2.8%
- emailRFCNoMatchesFirst                  110ms	107ms	3.06ms		2.8%
=== Improvements =====================================================================
- GraphemeBreakNoCapAll                   9.2ms	9.45ms	-257µs		-2.7%
- EagarQuantWithTerminalWhole             6.67ms	6.76ms	-83.9µs		-1.2%
- GraphemeBreakNoCapFirst                 79.2µs	82.8µs	-3.54µs		-4.3%

All tests are passing so I don't think it's because there's a bug somewhere
Checking the compilation does confirm that neither email regex is generating the new instruction
The current benchmark suite doesn't have many benchmarks that mostly use builtin character classes, most mostly use custom classes and only use the builtins for whitespace. Need to add new microbenchmarks and real world benchmarks that use builtins.

rctcwyvrn · 2022-07-06T00:19:07Z

Removing the remnants of the bitset fast path optimization I was testing seems to have reverted most of the regressions

Comparing against benchmark result file sanity_check_after_bitset.json
=== Regressions ======================================================================
- emailRFCFirst                           8.17µs	7.92µs	249ns		3.1%
- invertedCCC                             28.8ms	28.6ms	167µs		0.6%
- emailLookaheadAll                       82.3ms	81.8ms	446µs		0.5%
- emailRFCAll                             50.5ms	50ms	528µs		1.1%
=== Improvements =====================================================================
- GraphemeBreakNoCapAll                   9.02ms	9.45ms	-430µs		-4.6%
- caseInsensitiveCCC                      11ms	11.4ms	-375µs		-3.3%
- basicCCC                                9.54ms	9.9ms	-359µs		-3.6%
- ReluctantQuantWhole                     14ms	14.2ms	-238µs		-1.7%
- EagarQuantWithTerminalWhole             6.54ms	6.76ms	-216µs		-3.2%
- ReluctantQuantWithTerminalWhole         10.3ms	10.4ms	-184µs		-1.8%
- htmlAll                                 13.1ms	13.3ms	-137µs		-1.0%
- intersectionCCC                         15.4ms	15.5ms	-121µs		-0.8%
- basicRangeCCC                           10.5ms	10.5ms	-77.5µs		-0.7%
- cssAll                                  4.74ms	4.82ms	-74µs		-1.5%
- GraphemeBreakNoCapFirst                 78.5µs	82.8µs	-4.25µs		-5.1%
- cssFirst                                90.8µs	91.9µs	-1.13µs		-1.2%
- htmlFirst                               26.4µs	26.7µs	-291ns		-1.1%

But why?

Why did an extra computed property on BuiltinCC and an extra computation when emitting a matchBuiltin affect all these other benchmarks????
Why are the email regexes still regressing?? (albeit only a little bit now, it could just be variance from my machine)
Why do we now have improvements on regexes that don't have builtins (all the CCC and reluctant quant benchmarks) ???

Need to make 100% sure that compile time is not affecting the benchmark times.

A lot of this weirdness could also just be due to inaccurate timing from the benchmarker and variance because I'm running everything locally

rctcwyvrn · 2022-07-12T01:06:27Z

@swift-ci test

rctcwyvrn · 2022-07-12T01:14:26Z

Sources/_StringProcessing/Engine/Processor.swift

+    case .verticalWhitespace:
+      matched = c.isNewline && (c.isASCII || !isStrictAscii)
+    case .newlineSequence:
+      matched = c.isNewline && (c.isASCII || !isStrictAscii)


This was carried over from the old code, but why does this match a full \r\n sequence when in unicode scalars mode? Should we change that?

It's what is specified in the Unicode proposal:

Additionally, \R and CharacterClass.newline provide a way to include the CR+LF pair, even when matching with Unicode scalar semantics.

I don't know the exact rationale behind it, but I would assume it's done in an effort to be compatible with other regex engines that match scalars by default but with \R still matching \r\n.

rctcwyvrn · 2022-07-12T01:15:07Z

Sources/_StringProcessing/Engine/Processor.swift

+          return currentPosition == subjectBounds.upperBound
+        }
+
+      case .wordBoundary:


There seem to be no test cases that use the usesSimpleUnicodeBoundaries option, should we be testing that?

We definitely should be.

@Azoy?

We have a couple: testSimpleWordBoundaries and https://github.com/apple/swift-experimental-string-processing/blob/96fb215facf425226ca0645d5cab6e84a3befece/Tests/RegexBuilderTests/RegexDSLTests.swift#L540 but the latter was xfail, I have to fix the test because the test is wrong.

rctcwyvrn · 2022-07-12T01:18:37Z

@swift-ci test

Sources/_StringProcessing/Engine/InstPayload.swift

Sources/_StringProcessing/Engine/Processor.swift

milseman · 2022-07-12T17:12:43Z

Sources/_StringProcessing/Engine/Processor.swift

+      switch semanticLevel {
+      case .graphemeCluster:
+        return input.index(after: currentPosition) == subjectBounds.upperBound
+         && input[currentPosition].isNewline


CC @Azoy , these are more examples of a consume operation that we want for more efficient access to String. Something like func nom() -> (current: Character, next: Index)

Sources/_StringProcessing/Engine/Processor.swift

milseman · 2022-07-12T17:15:57Z

Sources/_StringProcessing/Engine/Processor.swift

+          return currentPosition == subjectBounds.upperBound
+        }
+
+      case .wordBoundary:


We definitely should be.

@Azoy?

Sources/_StringProcessing/Engine/Processor.swift

Sources/_StringProcessing/Unicode/WordBreaking.swift

rctcwyvrn · 2022-07-16T02:00:23Z

I may have gotten a little carried away while updating this branch to match main (internalization of _CharacterClassModel and changes to assertions) and I ended up implementing a new DSLTree node for character classes

It seems to all be working but there's a slight hiccup in inverting character classes that I'll work out on Monday

rctcwyvrn · 2022-07-16T02:07:12Z

@swift-ci test

- matchBuiltin always fails if at endIndex - fix switch in isStrictAscii

rctcwyvrn · 2022-07-19T01:01:57Z

@swift-ci test

milseman

In progress review, but plenty to communicate. Looks great! Lots of little tweaks and clarifications but this is awesome stuff!

Sources/_StringProcessing/ConsumerInterface.swift

milseman · 2022-07-25T13:52:05Z

Sources/_StringProcessing/ConsumerInterface.swift

@@ -162,6 +162,8 @@ extension DSLTree.Atom {
    case .assertion:
      // TODO: We could handle, should this be total?
      return nil
+    case .characterClass(let cc):
+      return cc.generateConsumer(opts)


Future: get rid of all these consumers except custom ones like custom character classes.

The main reason why all these consumer functions + _CharacterClassModel.matches still exist is because we have to emit any non-ascii CustomCharacterClass as one big consumer so it needs to know how to create a consumer for each possible member type (which includes Atom).

Do you think we could emit it as an ordered choice over its members instead? I think the main thing to add would be instructions for handling the custom character class set operations, and if we had that then we could rip out all of the non character property related stuff in ConsumerInterface

I'd like to avoid emitting backtracking traffic for custom character classes if we can. However, since there's a compilation boundary around each member anyways, that might not account for much. Let's keep this for now and try to come back to it again

Another thing to consider is that if we keep non-ascii custom character classes as one big consumer we're unable to access any optimized matching for any of its members.

Also just no longer having all the matching logic exactly duplicated across ConsumerInterface and Processor seems like something pretty important for future extensibility/preventing bugs. We pay some cost for emitting backtracking traffic but this is mostly for uncommon cases, the most common ascii cases are already handled by the bitset.

Sources/_StringProcessing/Engine/InstPayload.swift

milseman · 2022-07-25T14:08:22Z

Sources/_StringProcessing/Engine/MEBuiltins.swift

+  }
+
+  func isAtEndOfLine(_ payload: AssertionPayload) -> Bool {
+    if currentPosition == subjectBounds.upperBound { return true }


... similarly non-aligned upper bounds.

milseman · 2022-07-25T14:08:52Z

Sources/_StringProcessing/Engine/MEBuiltins.swift

+      switch payload.semanticLevel {
+      case .graphemeCluster:
+        return input.index(after: currentPosition) == subjectBounds.upperBound
+         && input[currentPosition].isNewline


Similarly, alignment?

milseman · 2022-07-25T14:09:33Z

Sources/_StringProcessing/Engine/MEBuiltins.swift

+    case .endOfSubject: return currentPosition == subjectBounds.upperBound
+
+    case .resetStartOfMatch:
+      fatalError("Unreachable, we should have thrown an error during compilation")


Future: we'll want to version support so that we can reason about back deployment

Sources/_StringProcessing/Engine/MEBuiltins.swift

- static vars in payloads - Clean up _CharacterClassModel - Use the model for bytecodegen and consumer interface - Merge the grapheme and scalar match builtin cases together

rctcwyvrn · 2022-08-03T00:49:57Z

@swift-ci test

rctcwyvrn added 4 commits July 5, 2022 14:21

Copy over new ascii bitset

3b6b676

Add matchBuiltin

33caa79

Remove debug prints

139daa5

Remove bitset fast path

9abf4af

rctcwyvrn marked this pull request as draft July 5, 2022 23:51

Fully remove remnants of the bitset fast path

286f5d8

rctcwyvrn added 2 commits July 7, 2022 16:59

Merge branch 'main' into speedy-builtins

9e915cd

Completely replace AssertionFunction with regexAssert(by:)

e593ddb

rctcwyvrn changed the title ~~[WIP] Implement instructions for matching builtin character classes~~ Implement instructions for matching builtin character classes and assertions Jul 12, 2022

rctcwyvrn marked this pull request as ready for review July 12, 2022 01:04

Merge branch 'main' into speedy-builtins

25dc277

rctcwyvrn commented Jul 12, 2022

View reviewed changes

Cleanup

3e38ac6

rctcwyvrn requested a review from milseman July 12, 2022 01:16

milseman reviewed Jul 12, 2022

View reviewed changes

rctcwyvrn added 5 commits July 12, 2022 12:19

Move match builtin and assert + Add AssertionPayload

e5d8b4a

Cleanup assertions

0466c25

Merge branch 'main' into speedy-builtins

87078ad

Fix tests

f401e84

Update opcode description for assertBy

b09f45f

rctcwyvrn mentioned this pull request Jul 14, 2022

[Optimization] Specialized quantification instruction #577

Merged

rctcwyvrn added 3 commits July 14, 2022 16:54

Merge branch 'main' into speedy-builtins

c581ea2

Merge branch 'main' into speedy-builtins

2a82231

Update branch to match main

fb1576a

rctcwyvrn added 2 commits July 15, 2022 17:35

Use the newly cleaned up _CharacterClassModel

3b9485e

Add characterClass DSLTree node

64d1ed9

rctcwyvrn requested a review from milseman July 16, 2022 02:07

hamishknight mentioned this pull request Jul 18, 2022

Add builtin character class atoms to the DSL tree #579

Closed

Bugfixes

2a6fe3c

- matchBuiltin always fails if at endIndex - fix switch in isStrictAscii

Add documentation for matchBuiltin

206bfc6

milseman reviewed Jul 25, 2022

View reviewed changes

rctcwyvrn added 4 commits July 25, 2022 12:31

Lots of cleanup

b53f524

- static vars in payloads - Clean up _CharacterClassModel - Use the model for bytecodegen and consumer interface - Merge the grapheme and scalar match builtin cases together

Move assertion payload

bb5245f

More minor cleanup

0746847

Perform boundary check for .anyScalar when in grapheme mode

c718543

rctcwyvrn mentioned this pull request Jul 26, 2022

Emit custom character classes like an alternation #590

Closed

rctcwyvrn added 2 commits July 27, 2022 17:17

Merge branch 'main' into speedy-builtins

d35b578

Merge branch 'main' into speedy-builtins

ca8acf2

rctcwyvrn merged commit 405fbcb into swiftlang:main Aug 3, 2022

Implement instructions for matching builtin character classes and assertions #547

Implement instructions for matching builtin character classes and assertions #547

Uh oh!

Conversation

rctcwyvrn commented Jul 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rctcwyvrn commented Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rctcwyvrn commented Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rctcwyvrn commented Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rctcwyvrn commented Jul 12, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rctcwyvrn commented Jul 12, 2022

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rctcwyvrn commented Jul 16, 2022

Uh oh!

rctcwyvrn commented Jul 16, 2022

Uh oh!

rctcwyvrn commented Jul 19, 2022

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rctcwyvrn Jul 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rctcwyvrn commented Aug 3, 2022

Uh oh!

Uh oh!

rctcwyvrn commented Jul 5, 2022 •

edited

Loading

rctcwyvrn commented Jul 6, 2022 •

edited

Loading

rctcwyvrn commented Jul 6, 2022 •

edited

Loading

rctcwyvrn commented Jul 6, 2022 •

edited

Loading

rctcwyvrn Jul 25, 2022 •

edited

Loading